PathAnalyser provides functionality for assessing ER and HER2 pathway activity in breast cancer transcriptomic datasets. The package enables classification of samples in transciptomic datasets according to pathway activity by using a gene expression signature associated with a pathway, a list of genes, which vary in expression depending on the pathway activity. These transcriptional signatures have been shown to have clinical predictive value, for example, ER and HER2 gene signatures have been associated with molecular sub-types, prognosis and treatment reponse in breast cancer. Although this package was original developed to assess ER and HER2 pathway activity in breast cancer transcriptomic datasets, the package functionality could be applied to transcriptomic datasets and gene signatures outside the context of breast cancer. In this vignette, we will describe how to use the PathAnalyser package with microarray and RNA-seq expression transcriptomic datasets and gene expression signatures associated with a specific pathway activity.
PathAnalyser 0.0.0.9000
PathAnalyser is an intuitive and user-friendly package that allows users to assess pathway activity of transcriptomic data sets using a pathway gene signature. PathAnalyser uses GSVA to estimate expression abundance of the up-regulated and down-regulated gene sets of the gene signature for each sample in a transcriptomic data set, and then ranks samples based on expression abundance of the up-regulated and down-regulated gene sets of the gene signature separately. The classification algorithm uses user-configurable thresholds to tailor the pathway-based classification to their own classification needs. Unlike other available pathway assessment packages which do not distinguish between up-regulated and down-regulated gene sets, the PathAnalyser classification algorithm considers both parts of a gene signature: the up-regulated and down-regulated genes separately in assessing consistency of expression with the gene signature.
Other features unique to PathAnalyser package include:
Multiple samples are considered for the classification analysis, building upon the sample-wise analysis provided by the GSVA package
Unlike other classification algorithms, PathAnalyser does not directly force classification of samples into active or inactive pathway classes
Detailed evaluation of the pathway-based sample classification including descriptive statistics such as accuracy, percentage classified, recall and ROC curve visualization
A flow chart of the general workflow of the analysis conducted using PathAnalyser is illustrated below:
Figure 1: PathAnalyser workflow
Note: The classification evaluation step is optional and can only be performed if actual pathway activity class labels (e.g. active, inactive or uncertain) for the corresponding pathway are present.
The PathAnalyser package functionality can be sub-divided into 5 broad categories corresponding to a typical workflow for assessing pathway activity of samples:
To install all required dependencies of PathAnalyser, type the following in R:
# If not already installed
install.packages("BiocManager")
BiocManager::install(c("GSVA", "pROC", "edgeR", "reshape2", "ggplot2","limma",
"VennDiagram", "NCmisc", "futile.logger"),
dependencies = TRUE)
The easiest way to install PathAnalyser is to install directly from GitHub using the devtools package in R:
install.packages("devtools")
devtools::install_github("ozlemkaradeniz/PathAnalyser")
Alternatively, you can clone the GitHub repository in a Linux / MacOS terminal:
git clone git@github.com:a-thind/PathAnalyser.git
Then type the following in R to install a local source version of the package:
library(utils)
install.packages("~/PathAnalyser", repo=NULL, type="source")
Once installation is completed, load the library to start using PathAnalyser:
library(PathAnalyser)
The classification algorithm of PathAnalyser requires two types of input data:
The two data types described above (gene expression matrix and gene signature dataframe) can be created from two types of input files:
knitr::include_graphics("expr_dataset_example.png")
Figure 2: An example gene expression matrix text file
The first line contains the sample names and the first column contains gene names. All other tab-separated fields represent expression values for each gene for each sample from either RNASeq or microarray experiments.
Note: PathAnalyser does not provide functionality for unnormalized microarray data, so the user must ensure the microarray data is normalized prior to performing classification by PathAnalyser.
knitr::include_graphics("up_gene_signature.png")
Figure 3: An example gene set file format containing either the up- or down-regulated gene sets of the gene signature
Note: read_signature_data from PathAnalyser only accepts gene signatures
that have the up-regulated and down-regulated gene sets in separate gene set
format text files (with file extension .grp). The user is expected to provide one
up-regulated and one down-regulated gene set file for a pathway gene signature.
PathAnalyser contains two gene signatures for active ER and HER2 pathways that can be used to assess ER and HER2 pathway activity in a breast cancer transcriptomic data set. These signatures are accessible by typing the following in R:
data("ER_sig_df")
data("HER2_sig_df")
The built-in ER signature ER_sig_df is a data frame containing the names of
160 differentially expressed genes which constitute the transcriptional
signature of ER pathway activation. This signature was obtained from the
sensitivity to endocrine therapy genome index defined by Symanns et al. (2011).
The total number of genes that are up-regulated (denoted with an expression
value of 1) and down-regulated (denoted with an expression value of -1) in this
gene signature are 101 and 59 genes respectively.
dim(ER_sig_df)
#> [1] 160 2
head(ER_sig_df)
#> gene expression
#> 1 ABAT 1
#> 2 ACADSB 1
#> 3 ADCY1 1
#> 4 ADCY9 1
#> 5 AGR2 1
#> 6 ANXA9 1
# up-regulated genes are given an expression value of 1
ER_up_sig <- ER_sig_df[ER_sig_df$expression == 1 ,]
dim(ER_up_sig)
#> [1] 101 2
# down-regulated genes are given an expression value of -1
ER_dn_sig <- ER_sig_df[ER_sig_df$expression == -1 ,]
dim(ER_dn_sig)
#> [1] 59 2
# for more details
?ER_sig_df
A gene expression signature for HER2 (ERBB2) pathway activation HER2_sig_df is
also provided by PathAnalyser. This gene signature was obtained from Molecular
Signatures Database (MSigDB)
by combining the up-regulated (gene set:
ERBB2_UP.V1_UP
) and down-regulated (gene set:
ERBB2_UP.V1_DN
) gene sets available at MSigDB, originating from transcriptomic profiling of a
MCF-7 breast cancer cell line, positive for the estrogen receptor (ER) but
genetically engineered to express the ERBB2 (HER2) receptor.
The names of 387 differentially expressed genes are in the HER2 signature data frame provided by PathAnalyser, of which 190 are up-regulated (denoted by an expression of 1) and 197 are down-regulated (denoted by an expression of -1).
dim(HER2_sig_df)
#> [1] 387 2
head(HER2_sig_df)
#> gene expression
#> 1 ABCC3 1
#> 2 ABCG1 1
#> 3 ACHE 1
#> 4 AKR1B10 1
#> 5 AKR1C2 1
#> 6 ALDH4A1 1
# up-regulated genes are given an expression value of 1
HER2_up_sig <- HER2_sig_df[HER2_sig_df$expression == 1 ,]
dim(HER2_up_sig)
#> [1] 190 2
# Down-regulated genes are given an expression value of -1
HER2_dn_sig <- HER2_sig_df[HER2_sig_df$expression == -1 ,]
dim(HER2_dn_sig)
#> [1] 197 2
# for more details
?HER2_sig_df
PathAnalyser also contains two built-in gene expression matrices each containing RNA-seq raw read counts for primary breast cancer samples obtained from 20 individuals (cases). Data for these matrices were obtained from The Cancer Genome Atlas (TCGA). Each expression matrix contains 20,124 genes.
data("ER_data_se1")
data("HER2_data_se1")
# column names represent case (sample) IDs from TCGA
The ER data set ER_data_se1 contains RNASeq raw read counts for 20 primary
breast cancer samples, 10 of which have ER pathway activity (ER+) and 10 which
have inactive ER pathway activity (ER-).
dim(ER_data_se1)
#> [1] 20124 20
# Expression data for first 6 genes
head(ER_data_se1)
#> TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 16841 6974 8211 5586 12210
#> TMEM107 189 851 581 433 1298
#> CAMK2A 10 20 18 27 19
#> TRAM1L1 233 198 426 328 1044
#> TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 8627 9138 3219 7663 6867
#> TMEM107 629 253 226 1265 385
#> CAMK2A 30 47 72 45 18
#> TRAM1L1 385 467 153 342 33
#> TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 4416 2869 718 14649 7106
#> TMEM107 332 660 230 1738 319
#> CAMK2A 29 34 25 4 48
#> TRAM1L1 212 482 28 428 507
#> TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 6351 3150 5278 4839 9226
#> TMEM107 633 234 245 284 565
#> CAMK2A 32 34 23 43 9
#> TRAM1L1 169 71 220 272 615
Similarly, the HER2 data set HER2_data_se1 contains RNASeq raw read counts for
20 primary breast cancer samples, 10 of which have HER2 (ERBB2) pathway activity
(HER2+) and 10 which have inactive HER2 pathway activity (HER2-).
dim(HER2_data_se1)
#> [1] 20124 20
# Expression data for first 6 genes
head(HER2_data_se1)
#> TCGA.A8.A090 TCGA.BH.A0AU TCGA.A2.A4S2 TCGA.C8.A1HM TCGA.GM.A2DK
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 4404 3614 5882 6380 5080
#> TMEM107 274 171 361 428 352
#> CAMK2A 42 30 21 13 20
#> TRAM1L1 698 514 367 84 265
#> TCGA.E9.A22D TCGA.AN.A0XV TCGA.BH.A0DG TCGA.AN.A041 TCGA.BH.A18P
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 14990 3110 10206 6425 7683
#> TMEM107 1058 742 488 595 239
#> CAMK2A 9 36 50 90 32
#> TRAM1L1 776 693 399 285 510
#> TCGA.AC.A23C TCGA.A8.A0A9 TCGA.AR.A0U3 TCGA.A8.A081 TCGA.BH.A0HQ
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 1 0 0 0
#> TAF7 11091 2000 4663 8107 5103
#> TMEM107 418 715 630 943 774
#> CAMK2A 87 58 6 38 63
#> TRAM1L1 473 598 135 251 253
#> TCGA.D8.A1XO TCGA.A8.A097 TCGA.C8.A1HK TCGA.AC.A23H TCGA.BH.A0BC
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 10467 14549 10020 8397 10158
#> TMEM107 1015 1228 149 367 385
#> CAMK2A 34 117 15 50 26
#> TRAM1L1 544 767 335 128 556
There are two types of input required for pathway activity assessment using PathAnalyser:
The read_expression_data function reads gene expression data from
a file with an delimiter, the function checks the delimiter of the
file and reads the file accordingly. It returns an output with gene symbols/IDs as the row names and columns representing each sample
IDs in the dataset.
data_se <- read_expression_data("/Users/taniyapal/Documents/Group Project/TCGA_unannotated.txt")
data_se<-data_se[,1:200]
print(data_se[1:5, 1:5])
Gene signature files can be read by the read_signature_data function. It returns a dataframe with signature IDs/symbols in the
first column and their expression pattern is represented as up r
regulated or down regulated by +1 and -1 respectively.
sig_df <- read_signature_data("/Users/taniyapal/Documents/Group Project/ESR1_UP.v1._UP.csv", "/Users/taniyapal/Documents/Group Project/ESR1_DN.v1_DN.csv")
head(sig_df)
The log_cpm_transformation function transforms the raw data by the method of log CPM and returns the transformed data. It also performs sanity check of the transformation by returning box plots for visualization of distribution of data before and after transformation.
# using toy data set as expression matrix
data("ER_data_se1")
data_se <- ER_data_se1
normalized_se <- log_cpm_transformation(data_se)
The gene names are first checked for inclusion in the gene signature data frame. Genes that are included in the gene signature are retained then subjected to further filtration, in which their expression across the data set measured. If a gene is not present in at least 10% of the total number of samples in the data set, the gene is dropped from the expression matrix. A console message is printed reporting the number of features (genes) retained in the final normalized expression matrix.
normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig_df)
#> Number of feature present in expression dataset: 146
PathAnalyser provides assessment of pathway activity for each sample in a gene expression data set, by extending an existing unsupervised and non-parametric sample-wise gene set enrichment method, Gene set variation analysis (GSVA). GSVA measures the variation of gene set enrichment across a sample population in a manner independent of class labels, and summarizes the sample-wise expression of the gene set in the form of gene set enrichment scores for each sample (Hänzelmann et al. 2013). However, unlike GSVA, which does not distinguish up-regulated gene sets from down-regulated gene sets, PathAnalyser considers the expression of the up-regulated and down-regulated gene sets of a given pathway’s gene signature both separately and explicitly, using the GSVA method solely for estimating the relative abundance of expression of the up- and down-regulated gene sets for a given pathway in each sample.
The estimated expression abundance for both these gene sets is used to check expression consistency with the gene signature and infer pathway activity for a given sample in the transcriptomic data set.
The classification algorithm implemented in PathAnalyser uses four thresholds for the classification algorithm (two for each gene set of the pathway signature):
The diagram below summarises the classification algorithm employed by PathAnalyser to classify samples by pathway activity. We include an external image with the R function:
Figure 4: Overiew of algorithm diagram
Samples are ordered by expression abundance of up-regulated and down-regulated genes of the gene signature separately using the GSVA scores. Those samples which have the most abundant expression of the up-regulated gene set (exceeding threshold 1) show evidence of activation of the pathway but are only classified as having the pathway “Active”, if the same samples show a low abundance of expression of the down-regulated genes (below threshold 4) of the gene signature i.e. demonstrating consistency between evidence from up- and down-regulated gene sets of the gene signature. Similarly, samples with evidence of pathway inactivation from the up-regulated gene set of the gene signature i.e. the least abundant expression of the up-regulated genes of the gene signature (below threshold 2) and which also show evidence of pathway inactivation from the down-regulated gene set of the signature i.e. most abundance expression of the down-regulated gene set of the gene signature (exceeding threshold 3), demonstrate evidence of consistency of pathway inactivation from both gene sets and therefore are classified as “Inactive”. All other samples are classified as “Uncertain” since there is insufficient evidence for pathway activation or inactivation.
In PathAnalyser, the GSVA scores thresholds for classifying a sample as “Active” (consistent with the gene signature) and “Inactive” (inconsistent with the gene signature) can be set manually and depend on the user’s pathway activity classification requirements.
PathAnalyser provides two functions for classifying samples using an absolute
GSVA scores thresholds classify_GSVA_abs or percentile thresholds which is
effectively rank-based classify_GSVA_percent using percentiles (default at 25%)
, as well as a visualization function to view the distributions of the GSVA
scores for both gene sets gsva_scores_dist, which will be discussed below.
PathAnalyser provides gsva_scores_dist method to visualize the GSVA score
distribution for the abundance of expression of the up-regulated and
down-regulated gene sets of the gene signature for each sample. The resulting
density plot usually follows a bimodal distribution of GSVA scores for each
sample for the up-regulated and down-regulated gene sets, a consequence of GSVA
employing the maximum deviation from zero method to calculate the GSVA score from
the Kolmogorov-Smirnov like random walk statistic.
Figure 5 shows two peaks for each gene set when running
gsva_scores_dist with logCPM-normalized ER_data_se1 gene expression matrix. For
both gene sets, the peak with the highest GSVA scores represents samples with
high abundance of expression of the gene set and the lower score peak
represents samples with low abundance of expression of the gene set.
gsva_scores_dist(ER_sig_df, normalized_se)
Figure 5: Density plots showing the distribution of GSVA scores for samples in logCPM-normalised ER_data_se1 data set for the up-regulated (left) and down-regulated gene set (right) of the pathway signature using gsva_scores_dist function
As the two peaks represent represent low expression abundance and high abundance expression, the distribution of GSVA scores that constitute the “valley” i.e. the area between the two peaks (fig. 6), represent samples that exhibit relatively weaker evidence of differential expression abundance relative to other samples and the algorithm would classify these samples as having “Uncertain” pathway activity. Because the peaks represent the mode GSVA scores (one for high and low expression abundance), thresholds should be selected between these two peak to enable the algorithm to classify samples around these modes as being consistent or inconsistent with the pathway signature.
As mentioned above, there are four thresholds that can be tuned by the user: the high and low expression abundance for the up-regulated and down-regulated gene sets. For example, a user could set the high expression abundance thresholds for the up-regulated gene set to -0.2 for the low expression threshold for both gene sets, and 0.2 for the high expression thresholds for both gene sets. Samples with GSVA scores between the these thresholds for both the up-regulated and down -regulated gene-set would be considered by the classification algorithm as having “Uncertain” pathway activity.
plot <- gsva_scores_dist(ER_sig_df, normalized_se)
# Add thresholds on plot
library(ggplot2)
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.2, 0.2))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
Figure 6: GSVA scores distributions of the samples for the down-regulated (left) and up-regulated gene set (right) of the pathway gene signature, showing absolute thresholds of -0.2 and 0.2 for characterising low and high expression abundance of both gene sets respectively
The expression data set used is the logCPM-normalised ER_data_se1 gene expression data set.
Shrinking the distance between the high and low expression thresholds would result in fewer “Uncertain” pathway activity classifications (fig. 7), because more samples would meet the consistency/ inconsistency expression thresholds for the pathway signature (high expression thresholds would be lower, low expression thresholds would be higher for each gene set).
# more relaxed thresholds, fewer uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"),
vline=c(-0.1, 0.1))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
Figure 7: GSVA scores distributions of samples with more relaxed thresholds for assessing expression consistency of the down-regulated gene set (left) and the up-regulated gene set (right) with the pathway gene expression signature
Low and high expression abundance thresholds for the down-regulated gene set (left) are set to -0.1 and 0.1 respectively, while low and high expression abundance thresholds for the up-regulated gene set (right) are also set to -0.1 and 0.1 respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).
Conversely, expanding the distance between the thresholds would increase the frequency of “Uncertain” pathway activity classifications, as the number of samples meeting the thresholds for consistency and inconsistency would be reduced.
# more stringent thresholds, greater uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.4, 0.4))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
Figure 8: Density plots for GSVA scores distributions for samples with relatively more stringent thresholds for assessing consistency of the up-regulated gene set (left) and down-rgulated gene set (right) with the pathway signature
The low and high expression abundance thresholds are -0.4 and 0.4 for both up-regulated and down-regulated gene sets respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).
As the distribution of GSVA scores tends to be biomodal rather than Guassian,
the user may prefer to use absolute GSVA thresholds for checking consistency of
expression abundance of each gene set with pathway signature for each sample.
PathAnalyser provides the classify_GSVA_abs function for classifying samples
by pathway activity using absolute GSVA score thresholds which can be tuned by
the user:
up_thresh.high: for high expression abundance for the
up-regulated gene-setup_thresh.low: for low expression abundance for the
up-regulated gene-set of the gene signaturedn_thresh.high: for high expression abundance for the
down-regulated gene-set of the gene signaturedn_thresh.low: for low expression abundance for the
down-regulated gene-set of the gene signatureNote that absolute thresholds are required from the user when running
classify_GSVA_abs and can be positive or negative numbers. A summary of the
classification is printed to the console highlighting the number of samples in
each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the
number of samples in total.
classes_df <- classify_GSVA_abs(ER_sig_df, normalized_se,
up_thresh.low=-0.2, up_thresh.high=0.2,
dn_thresh.low=-0.2, dn_thresh.high=0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 9 10 1
#>
#> Total number of samples: 20
#> Total number of samples classified: 19
The package also provides a GSVA-based classification method
classify_GSVA_percent, which uses percentile ranks as thresholds rather than
absolute GSVA thresholds. The default percentile used by the function is 25%, so
the high expression abundance thresholds in both gene sets would be 75%, by
default and low expression abundance thresholds in both gene sets would be 25%,
default. A custom percentile threshold can be provided using the optional
thresh_percent parameter. As for the absolute threshold function, a summary of
the classification is printed to the console highlighting the number of samples
in each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the
number of samples in total. Increasing the percentile threshold for
classification has the equivalent effect of reducing the distance between the
high expression abundance and low expression abundance thresholds for both gene
sets, therefore reducing the frequency of “Uncertain” classifications.
# default percentile = 25% (quartile)
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 4 5 11
#>
#> Total number of samples: 20
#> Total number of samples classified: 9
# custom percentile = 30%
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se,
percent_thresh=30)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 6 6 8
#>
#> Total number of samples: 20
#> Total number of samples classified: 12
An interactive PCA plot providing a visual summary of the pathway-based
classification can be produced by using the classes_pca function with the
normalized data set and predicted classes data frame. Each point represent a
sample in the expression data set and are colored according to the predicted
activity class (“Active”, “Inactive”, “Uncertain”). Hovering over the sample
points displays the sample name, along with predicted pathway activity class and
also the relative coordinates of the sample in the first two principal
components.
classes_pca(normalized_se, classes_df, pathway_name = "ER")
This package provides a method calculate_accuracy for evaluating pathway
activity classification, which gets the actual labels and the labels classified
by the classification methods as parameters; then creates confusion matrix based
on them. The first parameter true_labels_source could be file-name, matrix or
data frame and contains sample names and their corresponding actual
labels.The second parameter classes_df could be matrix or data frame and
contains classified labels found by the classification methods
classify_GSVA_percent or classify_GSVA_abs. Third parameter pathway
specifies with which pathway the labels(actual and classified) are associated.
Other parameters display_statistics and display_roc_curve are boolean flags
used as optional features to display statistics and ROC Curve respectively.
Default values for the flags is FALSE.
true_labels_df <- read.table("Sample_labels.txt", sep="\t", header=T)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df,
pathway = "ER", display_statistics=TRUE,
display_roc_curve=TRUE)
#> [1] "Confusion Matrix for ER pathway"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive 9 0 0
#> Prediction Negative 1 9 0
#> Prediction Uncertain 0 1 0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 95.00"
#> [1] "Accuracy amongst classified samples: 94.74"
#> [1] "True Positive(TP): 9"
#> [1] "True Negative(TN): 9"
#> [1] "False Negative(FN): 1"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 90.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 10.00"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Min. :0.000 Min. :0.000 Min. :0
#> 1st Qu.:0.500 1st Qu.:0.500 1st Qu.:0
#> Median :1.000 Median :1.000 Median :0
#> Mean :3.333 Mean :3.333 Mean :0
#> 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:0
#> Max. :9.000 Max. :9.000 Max. :0
To demonstrate the standard workflow of assessing pathway activity in a transcriptomic data set using PathAnalyser, we will use PathAnalyser to classify samples in an RNA seq data set according to ER pathway activity as (“Active”, “Inactive” or “Uncertain”) for the ER pathway. The ER pathway status of tumors is of particular interest since, activation of the pathway is correlated sensitivity to endocrine therapy and disease prognosis more broadly. The data set collected from The Genome Atlas Project (TCGA) contains raw read counts for over 20,000 genes (almost the complete human transcriptome) for 20 primary breast cancer samples. Ten samples were shown have the ER pathway active (ER+), while the remaining ten samples were reported to be inactive for the ER pathway (ER-).
First, the gene expression matrix data set text file which contains raw read counts for the 20 breast cancer samples and ER gene signature text files (up-regulated and down-regulated gene set files) are read into the expression matrix and gene signature data frame data types respectively. The gene expression matrix contains raw read counts for 20,124 genes and 20 samples. Additionally, the ER signature compiled from the Sensitivity to Endocrine Therapy (SET) index proposed by Symanns et al. (2011) contains 160 genes of which 59 are down-regulated and 101 are up-regulated.
# Load transcriptomic data set (gene expression matrix of samples)
data_se <- read_expression_data("toy_data.txt")
dim(data_se)
#> [1] 20124 20
head(data_se)
#> TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 16841 6974 8211 5586 12210
#> TMEM107 189 851 581 433 1298
#> CAMK2A 10 20 18 27 19
#> TRAM1L1 233 198 426 328 1044
#> TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 8627 9138 3219 7663 6867
#> TMEM107 629 253 226 1265 385
#> CAMK2A 30 47 72 45 18
#> TRAM1L1 385 467 153 342 33
#> TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 4416 2869 718 14649 7106
#> TMEM107 332 660 230 1738 319
#> CAMK2A 29 34 25 4 48
#> TRAM1L1 212 482 28 428 507
#> TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15 0 0 0 0 0
#> SPANXN2 0 0 0 0 0
#> TAF7 6351 3150 5278 4839 9226
#> TMEM107 633 234 245 284 565
#> CAMK2A 32 34 23 43 9
#> TRAM1L1 169 71 220 272 615
# read signature data from the two individual gene set files for up-regulated
# and down-regulated gene sets
ER_sig <- read_signature_data("ESR1_UP.v1._UP.grp", "ESR1_DN.v1_DN.grp")
dim(ER_sig)
#> [1] 160 2
As the gene expression matrix file contained raw RNASeq counts, the data must be
normalized prior to passing the matrix to any of the classification functions.
Normalization of read counts is necessary to account for differences in
sequencing depths among the libraries. The PathAnalyser log_cpm_transformation
function can be used to perform a log counts per million (CPM) transformation on
the gene expression matrix. Furthermore, log_cpm graphically confirms the
logCPM transformation has been performed correctly since the box plots
displayed by the function, representing the logCPM values for each sample, are
more aligned following the logCPM transformation compared to before the
transformation.
# Transforming matrix with log cpm transformation and sanity check of the transformation
normalized_se <- log_cpm_transformation(data_se)
To ensure that the gene names in the gene signatures and gene
expression data set are consistent, we use the
check_signature_vs_dataset gene
names to filter out genes from the gene expression matrix that are not in
signature and those genes which are not expressed in at least 10% of samples.
The output of running the function on the normalised dataset and ER signature
shows that 147 genes were retained in the normalised dataset with 13 genes being
dropped either due to expression in less than 10% of samples, or due to being
absent in the ER signature.
normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig)
#> Number of feature present in expression dataset: 146
dim(normalized_se)
#> [1] 146 20
Once the expression matrix is normalised to logCPM and genes with insufficient
counts or those that do not feature in the ER signature are removed, the
expression data set is ready for classification alongside the ER signature. To
acquire an idea of the distribution of GSVA scores, the gene expression matrix
can be passed as an argument along with the ER signature to the
gsva_scores_dist. A bimodal density plot is produced for the up-regulated and
down-regulated gene set of the ER signature, as described before reflective of
the underlying maximum deviation from zero method used to generate the GSVA
score statistic for representing expression abundance of the corresponding gene
set of the ER signature. The peak centered around the GSVA score of -0.8 for the
down-regulated gene set and the peak around 0.8 for the down-regulated gene set
represent the mode of samples with the low expression abundance of the
down-regulated genes of the ER signature and high abundance of the up-regulated
genes of the ER signature respectively. Similarly, for the up-regulated genes of
the ER pathway signature, the peak situated at around a GSVA score of -0.7 and
and around the GSVA score of 0.9 represent samples of low expression abundance
and high expression abundance of the up-regulated genes of the ER pathway
signature respectively.
gsva_scores_dist(ER_sig, normalized_se)
Samples in a gene expression matrix can be classified according to evidence of
expression consistency with the up-regulated and down-regulated gene sets of the
ER signature using GSVA by using absolute thresholds for high and low expression
for each gene set of the ER signature via
classify_GSVA_abs or percentile
threshold for high and low expression for each gene set of the ER signature via
classify_GSVA_percent. Here, we use the default percentile thresholds of 25%
to classify samples as having high or low expression abundance with the
up-regulated and down-regulated gene sets of the ER gene signature. The R
console output of running the classify_GSVA_percent on our data set and
ER signature, shows that of the 20 samples in our data set, 4 samples were
classified as active and 5 samples were classified as inactive for the ER
pathway. Eleven samples were classified as uncertain. A data frame is produced
with names of the 20 samples as the first column and their corresponding
predicted ER pathway activity classes as the second column.
classes_df <- classify_GSVA_percent(ER_sig, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 4 5 11
#>
#> Total number of samples: 20
#> Total number of samples classified: 9
head(classes_df)
#> sample class
#> 1 TCGA.AC.A3TN Active
#> 2 TCGA.AO.A0J8 Active
#> 3 TCGA.A2.A0YF Active
#> 4 TCGA.5L.AAT0 Active
#> 5 TCGA.E2.A15F Uncertain
#> 6 TCGA.AO.A1KQ Uncertain
Depending on the user’s classification requirements the thresholds can be adjusted. For example, if the user would like to reduce the number of “Uncertain” classification samples, the user could provide 50% percentile threshold as an argument e.g.:
classes_df_50p <- classify_GSVA_percent(ER_sig, normalized_se,
percent_thresh=50)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 10 10 0
#>
#> Total number of samples: 20
#> Total number of samples classified: 20
From the output of running the classification function with 50% percentile, it is evident that no samples were classified as “Uncertain”, with 10 samples classified as “Active” and 10 samples classified as “Inactive” by the classification algorithm.
Alternatively, the breast cancer samples in the data set could be classified by ER pathway activity, using absolute GSVA score thresholds for the up-regulated and down-regulated gene sets of the ER signature respectively.
classes_df_abs <- classify_GSVA_abs(ER_sig, normalized_se, up_thresh.low = -0.2,
up_thresh.high = 0.2, dn_thresh.low = -0.2,
dn_thresh.high = 0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#>
#> Active Inactive Uncertain
#> 9 10 1
#>
#> Total number of samples: 20
#> Total number of samples classified: 19
To acquire a visual summary of the ER pathway-based classification of our 20
breast cancer samples, we can run the classes_pca function to produce an
interactive PCA plot of the samples coloured by the predicted ER pathway
activity status (active, inactive, uncertain).
classes_pca(normalized_se, classes_df, pathway_name = "ER")
The PCA plot shows the samples predicted as active and inactive form relatively tight clusters in the first principal component (PC), as would be expected since the inter-variation within the active and inactive classes would be smaller than between the different classes. Interestingly, the active and inactive clusters are in opposite quadrants in the first PC indicating the samples have been separated according to pathway activity well in the first PC. Furthermore, samples classified as uncertain are more scattered in the first PC, with a few samples present in the active / inactive clusters. Overall the PCA plot shows the samples have clustered well according to the predicted pathway activity classes. Sample names and their corresponding predicted pathway activity class can be displayed for a given sample on the plot by hovering over a sample point using a mouse pointer.
A more detailed assessment of the ER pathway based classification of the breast
cancer samples can be performed by providing the true pathway activity class
labels for each sample, along with predicted pathway activity classes
classes_df to the calculate_accuracy function. These “true labels” must only
contain (“Active”, “Inactive or”Uncertain”) for the calculate_accuracy
function to correctly assess the pathway-based classification. The pathway
parameter is set to “ER” since ER pathway activity is being assessed. The
function outputs a confusion matrix reporting in tabular form, the number of
true positives (TP), true negative(TN), false positives (FP), false negatives
(FN) and also a breakdown of the number of uncertain classifications and their
actual pathway activity status. Further classification evaluation using the
display_statistics reveals an accuracy of 100% of among the 9 classified
samples, the 4 samples classified as ER positive were actually ER positive and
similarly, the 5 samples classified as ER negative were indeed ER negative.
# read true pathway activity class labels for data set samples
true_labels_df <- read.table("Sample_labels.txt",
header=TRUE, sep = "\t")
# assess accuracy and generate confusion matrix for classification
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df,
pathway = "ER")
#> [1] "Confusion Matrix for ER pathway"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive 4 0 0
#> Prediction Negative 0 5 0
#> Prediction Uncertain 6 5 0
#> [1] "--------------------------------------------------------------"
# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df,
pathway = "ER", display_statistics = T)
#> [1] "Confusion Matrix for ER pathway"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive 4 0 0
#> Prediction Negative 0 5 0
#> Prediction Uncertain 6 5 0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Min. :0.000 Min. :0.000 Min. :0
#> 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:0
#> Median :4.000 Median :5.000 Median :0
#> Mean :3.333 Mean :3.333 Mean :0
#> 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:0
#> Max. :6.000 Max. :5.000 Max. :0
In addition to a detail summary of the classification assessment, we can also view a receiver operating curve (ROC) curve for our classification, which visualizes the sensitivity and specificity of the data set classification by fitting a logistic regression model to the binary ER pathway activity classifications (active, inactive):
# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df,
pathway = "ER", display_statistics = T,
display_roc_curve = T)
#> [1] "Confusion Matrix for ER pathway"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive 4 0 0
#> Prediction Negative 0 5 0
#> Prediction Uncertain 6 5 0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#> Actual Positive Actual Negative Actual Uncertain
#> Min. :0.000 Min. :0.000 Min. :0
#> 1st Qu.:2.000 1st Qu.:2.500 1st Qu.:0
#> Median :4.000 Median :5.000 Median :0
#> Mean :3.333 Mean :3.333 Mean :0
#> 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:0
#> Max. :6.000 Max. :5.000 Max. :0
In concordance with the statistical classification evaluation metrics, the ROC
curve also shows that the samples were classified well according to ER pathway
activity by the curve reaching the top left corner of the plot (indicating a
high true positive rate and a low false positive rate). The nine points
represent the nine samples classified as “active” / “inactive” for the ER
pathway.
The output of sessionInfo upon which this file was generated:
sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.3.5 PathAnalyser_0.0.0.9000 GSVA_1.42.0
#> [4] BiocStyle_2.22.0
#>
#> loaded via a namespace (and not attached):
#> [1] plyr_1.8.7 lazyeval_0.2.2
#> [3] GSEABase_1.56.0 crosstalk_1.2.0
#> [5] BiocParallel_1.28.3 usethis_2.1.5
#> [7] GenomeInfoDb_1.30.1 digest_0.6.29
#> [9] htmltools_0.5.2 fansi_1.0.3
#> [11] magrittr_2.0.3 memoise_2.0.1
#> [13] ScaledMatrix_1.2.0 NCmisc_1.1.6
#> [15] limma_3.50.1 remotes_2.4.2
#> [17] Biostrings_2.62.0 annotate_1.72.0
#> [19] matrixStats_0.61.0 prettyunits_1.1.1
#> [21] colorspace_2.0-3 blob_1.2.2
#> [23] xfun_0.30 dplyr_1.0.8
#> [25] callr_3.7.0 crayon_1.5.1
#> [27] RCurl_1.98-1.6 jsonlite_1.8.0
#> [29] graph_1.72.0 glue_1.6.2
#> [31] gtable_0.3.0 zlibbioc_1.40.0
#> [33] XVector_0.34.0 DelayedArray_0.20.0
#> [35] pkgbuild_1.3.1 BiocSingular_1.10.0
#> [37] Rhdf5lib_1.16.0 SingleCellExperiment_1.16.0
#> [39] BiocGenerics_0.40.0 HDF5Array_1.22.1
#> [41] scales_1.1.1 futile.options_1.0.1
#> [43] DBI_1.1.2 edgeR_3.36.0
#> [45] Rcpp_1.0.8.3 viridisLite_0.4.0
#> [47] xtable_1.8-4 bit_4.0.4
#> [49] rsvd_1.0.5 stats4_4.1.3
#> [51] htmlwidgets_1.5.4 httr_1.4.2
#> [53] RColorBrewer_1.1-3 ellipsis_0.3.2
#> [55] pkgconfig_2.0.3 XML_3.99-0.9
#> [57] farver_2.1.0 sass_0.4.1
#> [59] locfit_1.5-9.5 utf8_1.2.2
#> [61] tidyselect_1.1.2 labeling_0.4.2
#> [63] rlang_1.0.2 reshape2_1.4.4
#> [65] AnnotationDbi_1.56.2 munsell_0.5.0
#> [67] tools_4.1.3 cachem_1.0.6
#> [69] cli_3.2.0 generics_0.1.2
#> [71] RSQLite_2.2.12 devtools_2.4.3
#> [73] evaluate_0.15 stringr_1.4.0
#> [75] fastmap_1.1.0 yaml_2.3.5
#> [77] processx_3.5.3 knitr_1.38
#> [79] bit64_4.0.5 fs_1.5.2
#> [81] purrr_0.3.4 KEGGREST_1.34.0
#> [83] sparseMatrixStats_1.6.0 formatR_1.12
#> [85] proftools_0.99-3 brio_1.1.3
#> [87] compiler_4.1.3 rstudioapi_0.13
#> [89] plotly_4.10.0 png_0.1-7
#> [91] testthat_3.1.3 tibble_3.1.6
#> [93] bslib_0.3.1 stringi_1.7.6
#> [95] highr_0.9 ps_1.6.0
#> [97] futile.logger_1.4.3 desc_1.4.1
#> [99] lattice_0.20-45 Matrix_1.4-1
#> [101] vctrs_0.4.0 pillar_1.7.0
#> [103] lifecycle_1.0.1 rhdf5filters_1.6.0
#> [105] BiocManager_1.30.16 jquerylib_0.1.4
#> [107] data.table_1.14.2 bitops_1.0-7
#> [109] irlba_2.3.5 GenomicRanges_1.46.1
#> [111] R6_2.5.1 bookdown_0.25
#> [113] IRanges_2.28.0 sessioninfo_1.2.2
#> [115] reader_1.0.6 lambda.r_1.2.4
#> [117] assertthat_0.2.1 pkgload_1.2.4
#> [119] rhdf5_2.38.1 SummarizedExperiment_1.24.0
#> [121] rprojroot_2.0.3 withr_2.5.0
#> [123] S4Vectors_0.32.4 GenomeInfoDbData_1.2.7
#> [125] parallel_4.1.3 VennDiagram_1.7.1
#> [127] grid_4.1.3 beachmat_2.10.0
#> [129] tidyr_1.2.0 rmarkdown_2.13
#> [131] DelayedMatrixStats_1.16.0 MatrixGenerics_1.6.0
#> [133] pROC_1.18.0 Biobase_2.54.0
Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013). https://doi.org/10.1186/1471-2105-14-7
Symmans, W.F., Hatzis, C., Sotiriou, C., Andre, F., Peintinger, F., Regitnig, P., Daxenbichler, G., Desmedt, C., Domont, J., Marth, C. and Delaloge, S., 2010. Genomic index of sensitivity to endocrine therapy for breast cancer. Journal of clinical oncology, 28(27), p.4111. https://dx.doi.org/10.1200%2FJCO.2010.28.4273